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(54) Worid wide web news retrieval system 

(57) A World Wide Web site data retrieval system 
includes an input device for inputtfng data and com- 
mands to access the WorkI Wkle Web. and a menxny 
for storing a Web site data retrieval driver which 
includes a Web reader, stored Web site address infor- 
mation, stored \Neb site commands, and stored format 
information. The memory also stores process steps to 
connect to a Web site and to issue commands within the 
connected Web site, and a connectton to the Wakl 
Wide Web. The system includes a processor for launch- 
ing the Web site data retrieval cfriver in response to a 
command to access the Worki Wide Web. The Web site 
retrieval driver, upon being launched. (1) launches the 




Web reader to connect to the Worid Wide Web via the 
connection. (2) retrieves the Web site address informa- 
tion and Web site commands. (3) instructs the Web 
reader to access the Web site based on the Web site 
address information arxJ Web site commands, (4) down- 
loads Web site data from the Web site based on the 
Web. site commands. (5) stores the Web site data in a 
linear documervt. (6) repeats steps 1 through 5 until all 
addresses in the stored Web site address information 
have been accessed, and (7) formats the linear docu- 
n>ent into a personalized document based on the format 
information. 
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Description 

BACKGROUND OF THE INVENTION 
Reld Of The Invention 

The invention relates to a data retrieval system which automatically traverses hypermedia documents on a compu- 
ter network and automatically retrieves information from those documents based on a match between the structure of 
the documents and a personalized data retrieval structure. More particularly, the invention can retrie^^e articles from a 
news service, from a magazine service, or from a contwnation of both services which are located on the World Wide 
Web, a private computer network that supports hypermedia links, or any other hypermedia-linked computer system. 

For example, there exists a Web site for retrieving news articles from the New Vbrk Times and a Web site for retriev- 
ing articles from People magazine. The retrieval system of the invention can traverse ttirough such Web sites and select 
articles based on a personalized data retrieval structure. The personalized data retrieval structure may include com- 
mands to retrieve a full text of the front page only, headlines of the business section, headlines of the stock section and 
sports section, eta In addition, the personalized data retrieval structure may Include content-based rules to retrieve arti- 
cles with certain keywords, to exclude articles with certain keywords, or to include articles based on a rule-based con- 
tent analysis. The inventton also provides a method for synthesizing all retrieved news articles and printing the 
synthesized news articles into a newspaper-type format in which each of the articles is arranged based on a user's pre- 
defined layout. 

While the above example is in the context of the Web. hypermedia documents can reskie on other types of net- 
works besides the Web, such as an intranet An Intranet is a private computer networic that is not connected to outskle 
computer netvirorks. For example, a company's own computer nelwori^ couki be an intranet with hypermedia documents 
on it. For brevity, the following discussfon is made with respect to ttie Wortd Wide Web. However, It shouki be under- 
stood ttiat the invention applies equally well to any type of corrputer networic that contains hypermedia documents, 
such as an intranet, different hypemriedia-linked conputer networks that reside on tiie Internet other than tiie Web. ^c. 

A hypermedia document on the Web can span multiple Web sites. Such documents can be newspapers, news arti- 
cles, magazines, catalogs, manuals, memoranda, and the like. For brevity, the following discussion is made with respect 
to sources of news informatfon. However, it shouW be understood thai the inventfon applies equally well to any other . 
type of hypermedia document 

Description Of The Rel^tyl Art 

The Wortd Wide Web is an on-line source of hypermecfia documents containing hypennedia text and images that 
act as links to other documents, Web sites, etc. As a result, documents on ttie Web are not organized sequentially. 
Rather, a user is automatically linked to other documents or Web sites to complete the viewing of a document by select- 
ing a hypermedia link, such as a text link or an image link, within the document Accordingly, an entire document cannot 
be viewed by scrolling through text 

One popular use of the Web is on-line publication and distribution of magazines and newspapers. Cunently. many 
Web news services, such as tiie New Vbri^ Times, allow the user to define keywords of interest and to receive news 
information, daily or hourly, that contains text matching tiie keyv«xds. The nevifs information can then be defivered to 
the user's computer via nxxJem or E-mail. However, most Web news site newrspapers, like the New Yori< Times, include 
too much information, mosX of which has no interest to the user since tiie information is retrieved based only on a key- 
wot6 match. 

Other sources of news information are provkJed ttirough information suppliers like "IndivWual Inc." Individual Inc. 
supplies users with a brief summary of the top twenty most relevant articles based on a user's predefined keywords. 
This subscription news sen/ice allows ttie user to spedfy five to ten areas of interest based on keywords, which are tiien 
prioritized by ttie user. The information servfce searches tiie Web for magazines and newspapers which contain any of 
the keywords. Based on tfie keyword searches, twenty of the most relevant articles are selected, compiled into a brief 
onei^age summary, and transmitted to tiie user via facsimile for ttie user's review. However, in order to review an entire 
document rather than ttie sunwiary. ttie user must tog onto a specifte Web site containing ttie document in order to 
retrieve and review ttie document. 

There are yet ottier services which perniit the user to personalize a newspaper to be displayed at tiie user's termi- 
nal by storing links to various news articles from various nevirs sources on tiie Weij. For example. CRAYON "Create Your 
Own Newspaper permits a user to select specific sections from among links to over twenty-five different on-line news- 
papers, and to compose ttie selections into a personalized newspaper. Using CRAYON, it is possWe to compose a per- 
sonalized newspaper containing, for example, links to the international section of ttie New Yori< Times, tiie business 
section of ttie Wall Street Journal, and ttie sports section of ttie Chfoago Tribune. The HTML (hypertext markup Ian- 
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guage) source file tor this newspaper is then stored to mass media storage for later use. 

WWIe the forgoing news and information services provide convenient ways to keep updated on the news» they do 
not allow a user to access and view the news in the way that people naturally read a real-world newspaper. Namely, 
people naturally read a newspaper by scanning the pages of sections that they find interesting and then reading those 
articles that grab their attention. In other words, people use a structural approach to decide what pages to look at ini- 
tially (e.g.. the first page of the Business and World sections, and the comics page of the Arts section). Ihey then scan 
the selected pages for artk^les. 

In sum. conventional news and information services do not allow a user to access data from a hypermedia docu- 
ment on the basis of the structure of the document, and then to format that data in a manner that allows the user to scan 
and read tiie data in a natural fashion. 

SUMMARY OF THE INVENTION 

In accordance with one aspect the invention addresses the above deficiencies in the art by accessing at least one 
hypermedia document, retrieving data from the hypermedia document into an extracted data tree, witti tiie data 
retrieved based on a structure of the hypermedia document, flattening the extracted data tree into a linear document, 
and formatting the linear document irrto a formatted document 

In another aspect, the invention creates a personal-news-profile for retrieving data from a hypermedia-linked com- 
puter network. The hypermedia-linked conputer network is accessed, a learning mode is started, the hypenpedia- 
linked computer network is traversed with commands, at least one rule is extracted from the commands, and the rule(s) 
is compiled into the personal-news-profile. 

In yet another aspect, the invention creates a personalization profile for a Web site retrieval data retrieval system. 
Data and commands are input to access the WorW Wide Web and a connection is made to tfie World Wide Web. A Web 
reader e launched, and the Web reader accesses the Web via the connection. In response to user commands, a learn- 
ing mode is entered inta Commands are sent to traverse the Worid Wide Web, and at least one rule is extracted from 
the commands. The rule(s) is compiled into a personalization prof Be. which is stored. 

In yet another aspect, the Invention retrieves articles from a hypennedia-linked computer nelworit and fbrmats the 
articles into a personalized newspaper. A stored personal-news-profile is retrieved. TTie personal-news-profile includes 
address data for a site on ttie hypermedia-linked computer network, command data for accessing data from the site, 
and newspaper layout commands. The site is accessed based on address data stored in the personal-news-profile, and 
articles at the site are downloaded based on command data stored in ttie personal-news-profile. The downloaded arti- 
cles are flattened into a linear document, and the linear document is formatted into the personalized newspaper accord- 
ing to newspaper layout commands stored in the personal-news-prof ila 

In yet another aspect, ttie invention retrieves data from a Worid Wide Web site and formats ttie data into a person- 
alized document A Web site data retrieval driver which includes a Web reader, stored Web site address information, 
stored Web site comnr\ands. and stored format infonnation is accessed. The invention (1) launches the Web reader to 
connect to the World Wide Web via a connection to the Web, (2) retrieves the Web site address information and Web 
site commands. (3) instructs tiie Web reader to access tiie Web site based on tiie Web site address information and 
Web site commands. (4) downloads Web site data from the Web site based on the Web site commands, wherein the 
data is downtoaded with reference to a linked fist so as to avoid hypennedia-links that fom kx)ps and so as to avoid 
repetitious downloading of data tiiat has already been downtoaded. (5) stores the Web site data in a linear document, 

(6) repeats steps 2 through 5 until afl addresses in the stored Web site address information have been accessed, and 

(7) fbmiats the linear document into the personalized document based on ttie fbnfnat Information. 

In yet another aspect, the invention accesses and retiieves data at WorW Wide Web sites and fbrmats the data »Tto 
a personalized document The invention connects to tfie World Wide Web, retrieves user defined W^ site address 
infonnation. user defined Web site commands, and user defined formatting commands, and activates a Web reader so 
as to access a Web site based on the user defined Web site address infonnation. The Web reader is used to download 
data from the Web based on ttie user defined Web site commands, and ttie data is downtoaded into an extracted data 
tree. The downloading continues until aD addresses in ttie user defined Web site address information have been 
accessed. The extracted data tree is flattened into a linear document and ttie flattened document is fonnatled into ttie 
personalized document based on the user defined fbmnatting commands. 

In yet anottier aspect, ttie invention retrieves news articles from on-line news senflces on ttie World Wide Web and 
formats ttie news articles into a personalized newspaper. The invention stores a personal-news-profile which comprises 
addresses data and command data for accessing data from a Web site and newspaper forniat commands, retrieves ttie 
stored personal-news-profile and accesses ttie data stored ttierein. activates a Web reader to contact a Web site based 
on address data stored in ttie personal-news-profile, downloads news arttoies at ttie contacted Web site based on com- 
mand data stored in the personal-newsijroffle. stores ttie dovmloaded news articles, and fomnats ttie stored news arti- 
cles into ttie personalized newspaper based on ttie newspaper format commands stored in ttie personal-news-prof Be. 
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In yet another aspect, the invention formats a hypermedia document into a personalized document A location of 
the hypermedia document is specified, a type of the hypermedia document Is specified, a scope of data to be retrieved 
from the hypermedia document is specified, wherein the scope is k>ased on a structure of the hypermedia document, 
and a format is specified for formatting the data retrieved from the hypermedia document into the personalized docu- 

5 ment. The hypermedia document found at the specified location is accessed, data is retrieved from the hypermedia 
document in accordance vinth the specified hypermedia document type and in accordance vwth the specified scope, and 
the data is fonmatted into the personalized document in accordance vvith the specified format. 

In yet another aspect, the invention is a system for processing a hypermedia document. The system accesses the 
hypermedia document, extracts addresses from the hypemiedia document and stores the addresses extracted from 

w the hypermedia document in a container. The system activates a processing function to process data stored at the 
addresses stored in the container, downloads the data stored at the addresses stored in the container into a memory, 
and extracts predetermined data from downloaded data in accordance with predetermined configuration information. 
The predetermined data is then formatted in accordance with predefined formatting settings to generate a fomiatted 
document, and the formatted document is processed in accordance with the processing functioa 

IS In preferred embodiments, the system inputs the formatting settings and configuratfon information via a graphical 
user interfaca The graphical user interface comprises plural processing icons, one of which activates the processing 
function. By virtue of the graphical user interface, a user can interactively set a documenfs format and change that for- 
mat should a change be desired. 

In particularly preferred embodiments, the graphical user interface is displayed in plural modes. The plural modes 

20 comprise (1) a fully-fonctional mode in which the graphical user interface displays formatting fields, processing options, 
menus and the processing icons, and (2) a minimizing mode in which the graphical user interface displays only the 
processing icons. Typically, the graphical user interface displayed in the minimizing nxxfe is dispfayed during browsing 
the hypermedia document. By displaying the graphical user interface in plural modes, the present invention facilitates 
operation of the invention during browsing of the hypermedia document. 

25 This summary has been provided so that the nature of the invention may be understood quickly. A more complete 
understanding of the invention can be obtained by reference to the following detailed description of the preferred 
embodiments thereof in connection with the attached drawings. 

BRIEF DESCRIPTION OF THE DRAWINGS 

30 

Figure 1 is a perspective view showing the outward appearance of the personal news retrieval system according to 
an embodiment of the invention. 

Rgure 2 is a block diagram of the personal news retrieval system shown in Rgure 1 . 

Rgure 3, comprised of Rgures 3A. SB. 3C and 3D. shows representational diagrams illustrating an example of the 
35 transformation of infonnatfon from the Web (Rgure 3A) to an extracted data tree (Rgure 3B), then to a flattened docu- 
ment (Figure 3C). and finally to a formatted document (Rgure 3D) according to an embodiment of the invention. 

Rgure 4 is a representational btock diagram of the manner by which a personal-newsiMrofile for retrieving news 
articles via the Web is created or edited according to an entxxJiment of the invention. 

Figure 5, comprised of Rgures 5A and 5B, shows flow cfiagrams describing how a personal-news-profile is created 
40 or edited. 

Figure 6 is a representational block diagram of the manner by whfch news articles are retrieved from the Web and 
fbriTOtted with reference to a personal-new-profBe according to an embedment of the inventfon. 

Figure 7 is a ftow diagram describing how news articles are retrieved from the Web with reference to a personal- 
news-profil& 

45 Rgure 8 is a flow diagram showing how retrieved news articles are formatted with reference to a personal news pro- 
file and sent to a print device interface. 

Rgures 9A to 9E depict a graphical user interface used with the second embodiment of the present inventioa 
Figure 10 is a flow dfagram describing the operation of the second embodiment of the inventfon. 

so DETAILED DE SCRIPTION OF THE PREFERRED EMBODIMENT 

Figure 1 is a view showing the outward appearance of a representative embodiment of the inventioa Shown in Rg- 
ure 1 Is coirputing equipment 1. such as a Macintosh or an IBM PC or a PC-compatible computer, having a windowing 
environment, such as Microsoft Windows. Provided writh computing equipment 1 is display saeen 2. such as a color 
55 monitor or a monochromatic monitor, keyboard 3 for entering text data and user commands, and a pointing devfce such 
as nrouse 4 for pointing and for manipulating objects displayed on display 2. Computing equpment 1 also includes a 
mass storage device such as disk drive 5, lnr>age data can be irput into computing equipment 1 from a variety of 
sources such as a network interface 11a or from external devices vfa facsimfle^modem interface 6. Netmtk interface 
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1 la is used to conned computing equqjment 1 to a local area network (LAN) or to a wide area network (WAN) such as 
the Worid Wide Webi 

Rgure 2 is a detailed t)lock diagram showing the internal construction of computing equipment 1. As shown in Fig- 
ure 2, computing equipment 1 includes central processing unit (CPU) 8 interlaced with computer bus 9. Also interfaced 

5 with computer bus 9 is printer interface 1 0, faxAnodem interface 6, display interface 1 1 . network intertece 1 la, keyboard 
interface 12, mouse interface 13. main memory 14, and disk drive 5. 

Main menwry 14 interfaces with computer bus 9 so as to provide random access memory storage for use by CPU 
8 when executing an application such as personal-news-profile editor 16 or Web printer 17. More specifically, CPU 8 
loads these software applications from disk drive 5 into main memory 14 and executes the software applications out of 

10 main menx)ry 14. In accordance with user instructions, stored appIk»tion programs are activated which pennit 
processing and manpulation of data. Typically, the software appfications stored on disk drive 5, such as personal-news- 
profile editor 16, Web printer 17, and HTMLtormaHer 18, have been stored on disk drive 5 by downloading the software 
applications from a conputer-readaWe medium such as a f toppy disk or CD ROM, or by downloading the software appli- 
cations from a computer bulletin board. 

IS Disk drive 5 staes data f fles which can include text files and image fBes, in compressed or uncompressed fbnnat, 
and stores software application files such as those noted above. The software application files Include Windows appli- 
cations, DOS application, and personal news retrieval files 15. Personal news retrieval files 15 include personal-news- 
profile editor 16, Web printer 17, HTML formatter 18, personal-news-profile(s) 19, and site profile(s) 20. The detailed 
functions of personal news retrieval files 15 will be discussed below, after a brief overview of the operation of the per- 

20 sonal new retrieval system. 

Qvervjew of Docunf^ent Pgtriovai 

Rgure 3, comprised of Rgures 3A to 3D. illustrates the operation of a representative embodiment of the invention. 
25 Rgure 3A is a graphical representation of a typical Web site 21 with news information contained therein. Within Web 
srte 21 is homepage 22 with links to indices such as headings 23, which are in turn linked to artkdes 24. Some of articles 
24 are linked to other articles. As artfcle H 26 reskJes on another Web site, link 25 is a aoss-site link. Unk 25 illustrates 
how a single hypermedia document, represented by homepage 22. can traverse multiple Web sites. 

In order to retrieve news from Web site 21 , the invention first traverses Web site 21 to retrieve data according to 
30 user-d^ined mles. As will be discussed in more detail below, these rules can be based on the structure of Web site 21 , 
or on the structure of Web site 21 and its contents. The data is retrieved into an extracted data tree, which presences 
the organization of the data as shown in Figure 3B, but in which some links are excluded. 

The organization of extracted data tree 27 has several features. Rrst. extracted data tree 27 has root 28 which can 
have chikJ nodes for one or nrore sites 29. which in tum can have index nodes 30 which correspond to indices/headings 
35 23. articles nodes 31 . and the like. Second, extracted data tree 27 is a true tree, with no loops (i.e.* cyclic paths) therein. 
For example, Rgure 3A shows a loop from homepage 22 to index node #1 . to article C. and then back to homepage 22. 
This loop is removed when creating extracted data tree 27. 

Second, the organization of extracted data tree 27 depends on how the Web sites are traversed, and not on the 
Web sftes* actual layouts. Thus, artfcle H 26 appears under index node #3 (under site #1), indicating that the news 
40 retiieval system accessed article H 26 from site #1 via cross-site link 25. 

Rnafly, as noted eartier. certain articles have been excluded from extracted data tree 27 due to the structure of Web 
site 21 or possibly a content of indices/headings 23 and artfcles 24. f=br example, artfcles E and G have been excluded 
from extracted data tree 27. 

According to this embodiment of the invention, extracted data tree 27 is flattened into linear document 32, as shown 
45 in Rgure 3C. possibly with reference to nrore exclusion rules. Unear document 32 ^ simply a continuous document witti 
infonnation from extracted data tree 27 eni>edded ttierein. 

Rnally, Bnear document 32 is formatted according to user-specified (or defeult) formatting instructions into format- 
ted document 33, shown as a stylized personal newspaper in Figure 3D. Fonnatted document 33 has various fonts 
andfor colas for sHe labels, indices/headings, articles, and the lika Furttiermore, formatted document 33 is broken 
so down into pages. 

Note ttiat in alternate embodiments of tiie news retrieval system, certain stages of the above transformation from 
Web site 21 to fonnatted document 33 can be skipped. For example, data from Web site 21 can be retrieved directfy 
into flattened document 32, as long as a record of the organization of the data maintained (possibly in a separate 
rmked fist) so as to avoki downloading the same article twfce and so as to avokf loops in the organization of Web site 
55 21. Alternativety, extracted data tree 27 can be directiy fonnatted into formatted document 33. In any case, tfie basic 
operation of ttie invention remains tiie same: the news retrieval system traverses a hypermedia document on ttie Web, 
extracts data according to user-defined infonnation, and formats ttie data into a personalized newspaper. 

As mentioned in the above discussfon, various user-defined rules and other infonnation (such as formatting infor- 
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mation) axe involved in the news retrieval process. That user defined Information is stored in personal-news-prdile(s) 
19, the definition of wNch is desabed next. 

Defining a Personal-News-Prof lie 

Figures 4 and 5 illustrate the process by which personal-news-prof ae 1 9 is defined. To create personal-news-profile 
19. personal-news-profile editor 16 communicates with personal-news-profile 19, site profrie 20, and Web reader 34. 

Personal-news-profile 19 contains information as to wrtiat sites to access for creating a personalized newspaper, 
what sections to retrieve from those sites* rules to be used to determine what data to extract from the sections and the 
articles therein, rules to determine how to exclude links, and newspaper format infbrmafion. A sample personal-news- 
prof He is shown in Appendix 1. 

Site profile 20 includes general site information that is not specific to a particular user. For exanple. site profile 20 
could contain infonnation such as full site addresses, sections vwthin a site, non-user specific passwords, etc. Sanple 
site profiles are shown in Appendix 1 . Because general site information is stored in site profile 20. personal-news-profile 
19 can refer to the general site information with reference to site profile 20, saving space in the personal-news-profile. 
For exarrple. as shown in Appendix 1, personal-nevi^-profile 19 can refer to a site number 1. Site profile 20 indicates 
that site number 1 is the "San Jose Mercury News." with a homepage at "http//wvw.simercury.comr. This construction 
also centralizes general site information. Thus, if a site address changes, only site profile 20 needs to be changed to 
update all personal-news-profiles 19 on the system. 

Web reader 34 is an application program or program module that conrnmicates with the Web via Web sender 35. 
In response to commands from personal-news-profile editor 16. Web reader 34 vwll access the Web, traverse hypemie- 
da documents on the Web, retrieve data from the documents, and return the retrieved data to personal-news-profile 
editor 16. 

As shown in Figure 4, personal-news-profile editor 16 includes four modules: site driver 36. Web reader interface 
37, profile manager ;38. and format editor 39. 

Web reader interface 37 interfaces personal-news-profile editor 16 to Web reader 34. Site driver 36 interacts with 
Web reader 34 via Web reader interface 37 to provide an abstract interface to each individual Web site. More spedf i- 
cally. site driver 36 instructs Web reader 34 to access various Web sites and to retrieve data from those sites. Thereaf- 
ter, site driver 36 receives that data and builds site profile 20 therefrom. The date can also be used to update an existing 
site profile. 

In building site profile 20, site driver 36 translates the structure of each accessed Web site to a unilbrm structure 
defined in site profile 20, and stores data retrieved therefrom in site proffle 20. By translating different Web sites, some 
of which may have different structures, into a single uniform stmcture and storing data therefrom in that structure in site 
profile 20. the present invention facilitates access to information from different Web sites, and thus reduces overall 
processing time. 

Profile manager 38 maintains document templates that specify how to format a personalized newspaper. Prede- 
fined document templates exist. In addition, formal editor 39 allows a user to specify personalized tenplates for format- 
ting a newspaper, either by editing existing templates or by creating new ones. In any case, each document template 
specifies page layout infonnation, font information, style information, colors, etc. for the tities. indices/headings, sub- 
headings, text and the like for a personalized newspaper. 

Sample code for personal-news-profile editor 16, site driver 36. and profile manager 38 is included in Appendix 3A. 

Rgures 5A and 5B are ftow diagrams descrft)ing the operation of personal-news-proRle editor 16 In more detaO. 
Figure 5A shows ttie operation of personal-nevre-profile editor 16 in defining the parts of personal-news-profile 19 relat- 
ing to accessing Web sites and retrieving data from those sites. 

In step S500 of Rgure 5A, personal-news-profae editor 16 is launched by a user. In step S501. the editor launches 
Web reader 34. The users personal I.D. is then retrieved in step S502. If a personal-news-profile already exists for that 
I.D.. step S503 directs flow to step S504. where the user is given the option of skipping to the fbmnat editor. Othenwise. 
personal-news-profile editor 16 enters a "learning mode" in step S505. Once in the learning mode, personal-news-pro- 
file editor 16 proceeds to step S506. where it accepts a Web command (i.e.. a command to traverse a hypermedia link) 
from ttie user and forwards the Web command to the Web reader by means of site driver 36. Site driver 36 maintains a 
hierarchical log of Web sites visited by Web reader 34. In step S507. personal-news-prof ne editor 1 6 creates an extrac- 
tion rule from ttie Web command. This mie will alk)w ttie news retrieval system to later duplicate the user's selection 
criteria in browsing (clicking on hyperiinks within) a Web site. 

The rule specTies. at the least slnjctural criteria for duplicating the traversal of the Web site. For exanple, H a user 
accesses all articles under a particular index^eading, the rule will specify that ail articles under that index/heading 
shouki be retrieved. 

In one embodiment of the invention, the rule can also include content-based criteria (l.e., teyword-based criteria) 
accepted from ttie user. These content-based rules can. for example: (1) require certain words to be in an article, (2) 
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exclude artictes with certain words, (3) require certain boolean combinations of words. (4) rank articles that are selected 
based on structural criteria, with the ranking based on keywords, and then require the selection of the artictes with the 
highest ranking(s), or (5) exclude certain types of articles such as advertisements. 

Examples of the syntax for the structural and content-based exclusion rules are shown in Appendix 2. Several dif- 

5 ferent types of rules are shown. Some simply limit the traversal of a Web site to a certain number of links. Others are 
date and keyword based exclusion rules. One particularly flexible rule indicates that articles shouM be ranked based on 
a keyword analysis and the top scoring articles shouki be chosen. Other rules include llattening" rules. These rules 
control the flattening of the extracted data tree, as will be explained in more detail below. 

At the least the rule includes structural information about the user's selection (i.a. first page, first document, all 

10 links, etc.). necessary password Infbrn^tion. browser commands, and the like. The rule can also include a pointer or a 
reference to site profile 20 and the appropriate informatton therein. General (non-user specific) information is used by 
site driver 36 to maintain site profile 20. In this manner, address infbnmatk>n and passwords common to multiple users 
can be maintained in site profile 20. as discussed above. For example, site driver 36 will store commands or hyperlinks 
to other documents in a Web page in the rule, but will not store a Web site's full address in the rule. That address infor- 

15 nrmtion is stored in site profile 20. 

In step S508, rule data defining the rule aeated from a Web command(s) is stored in an extracted data tree such 
as extracted data tree 27 in Rgure SB. This data tree is a linked list that reflects the organization of the data retrieved 
from the Web. In step S509. flow returns to step S506 for the next Web command unless the user is done (i.e., the user 
signs-off the Web site), in which case flow proceeds to step S510. 

20 At this point the creation of the personal-news-profile has proceeded much like the aeatksn of a macro common 
to word processing programs, except that site profile 20 has been used to nuninuze storage requirements and to cen- 
tralize general site information. In order to minimize storage requirements further and in order to make the news 
retrieval system more flexft)le and efficient, the extracted rules are now compiled to renwe redundant links, multiple 
visits to the same site, and the like. This occurs in step S510, and the resulting compiled rules become the first part of 

25 personal-news-profile 19. 

Alternatively, personal-news-profile editor 16 may be invoked as a graphical user interiace which allows a user to 
edit a previously stored personal-news-profile or to specify document composition preferences, for example, by speci- 
fying news sites, headline articles only, keywords, etc. In either case, the result is personal-news-profile 19, which com- 
prises a listing of Web site pointers as well as extracted rules for traversing through a Web site or sites. 

30 For a better understanding of the above, sample personal-news-profiles and sample site profiles are provkled in 
Appendix 1 as noted above. 

Next operation proceeds to give the user an option to nrxxJify a custom newspaper template, as shown in Rgure 
5B. In step S51 1, it is determined if a newspaper tenplate has been defined and stored in personal-news-profile 19. If 
a newspaper template has been defined, step S512 gives the user the option to edit the template or to proceed to step 
35 S520. If the user chooses to edit the template or if no newspaper template has been defined, f kTw proceeds to step 
S513. 

Step S51 3 gives the user the optbn of creating a custom template or using a predefined template. If the user wants 
to use a predefined terrplate, step S514 gets the specified predefined template, which is added to the personal-news- 
profile in step S519. Othenvise. fkiw proceeds to step S515. where format editor 39 is Invoked 

40 Fomiat editor 39 has a graphical user interface that provides the user with a number of formatting options. In step 
S516. fonnat editor 39 allows the user to define which newspaper secUons are to be printed in the newspaper, which 
Web site's news article are to be placed in each section, and/br how each page is to be laid out In this regard, the user 
can specify which Web site's news articles are to be used as a front page, whk^h Web site's news articles are to be used 
as a business page. whk:h Web site's news articles are to be used as a sports page, etc. In addition, in step S516, the 

45 user can define where each index/heading should be listed, as well as what siAhheadings shoukf go on each page. 

In step S517, format editor 39 allows the user to defhe the font styles for indices/headings, sub-headings, bylines 
and actual text of news articles. In st^ S518, format edrtor 39 prompts the user to define index/heading colors, title 
cofors, etc. In this regard, layout editor 39 is capable of determining the types of fonts and odors available to the user 
based on ttie system's printer capabilities. 

50 Once all of the Information is gathered for the custom template, the format editor adds the information to personal- 
news-prof Be 19 in step S519. AHematively. profile manager 38 may also store the custom format as a template in a 
common area for use by other users. In ttiis case, only a pointer or reference to the custom template is stored in per- 
sonal-news-profile 19. 

In step S520. personal-news-profile editor 16 prompts the user to set an automatic newspaper delivery time and 
55 method (i.e., print or store on disk drive 5 for later printing). These settings are added to personal-news-proffle 19. More 
specif kally, in ttie case that a user's computer is continuously supplied with power, the Web news retrieval system can 
be launched automatically at a designated time. The system will retrieve articles from tiie Web sites which are listed in 
personal-news-profile 1 9. Upon retrieving the news articles, the articles will be formatted based on the newspaper tem- 
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plate in personal-news-profile 19. The formatted personalized newspaper can then t>e either printed or stored for later 
viewing. In the case that a time is not set for newspaper delivery, the user can execute the Web news retrieval system 
program at any time. 

Once personal-news-profile 19 has been created, the Web news retrieval system, upon being launched, can 
traverse Web news sites and build a personalized newspaper by automatically retrieving various news articles from the 
Web news sites and print the news articles based on the newspaper template indicated in personal-news-profile 1 9. A 
desaiption of how the Web news retrieval system of the invention performs this function is described next 

Retrievino a Document Using a Personal-News-Profil^ 

Figure 6 is a representational block diagram of the manner by which the invention retrieves articles from the Web 
according to personal-news-profile 19. (Rgure 6 also shows the manner by which the retrieved articles are flattened into 
a linear document and formatted. These functions are discussed in greater detail in the next section of this application.) 

As shown in Figure 6, Web printer 1 7 is responsible tor retrieving news articles. Web printer 1 7 is an end-user appli- 
cation that communicates with personal-newsiprofile(s) 19. site profHe 20. Web reader 34. and output interface 40 in 
order to perform this function. 

Web printer 17 looks at personal-news-profile 19 to determine which Web sites to access and which data to retrieve 
from those sites. Web printer 17 also looks at site profile 20 for general site information. According to the information in 
personal-news-profile 19 and site profBe 20, Web printer 17 instructs Web reader 34 to connect to the Web via Web 
sender 35 in order to access various Web sites and to retrieve data from those sites. Web reader 34 sends the retrieved 
data to Web printer 1 7. and Web printer 1 7 uses the data to build an extracted data tree. As will be discussed in greater 
detail In the next section of the application. Web printer 17 then flattens the extracted data tree into a linear document 
arKf formats the linear document for output via output interface 40. 

As shown in Figure 6, Web printer 1 7 includes four program modules: Web reader interface 50. site driver 51 . tree 
manager 41 , and formatter 42. 

Web reader interface 50. like Web reader interface 37 described above, interfaces Web printer 17 to Web reader 

34. 

Site driver 51 accesses site profile 20 and personal-news-prcf fle 1 9 and provides data stored therein to Web reader 
34. As noted akxwe. Web reader 34 uses that data to access various Web sites and to extract data therefrom. As noted 
above, this retrieved data is used by Web printer 37 to build an extracted data tree. 

Tree manager 41 manages the extracted data trea In this regard, tree manager 41 keeps track of the organization 
of the retrieved data in the extracted data tree. This aHows Web printer 17 to avokJ accessing the same article twice, to 
avoid unnecessarily re-visiting a Web site, and to avoid getting caught in a cycle (loop) in the organization of a hyper- 
media document on the Web. Alternatively, tree manager 41 could store the data in blocks (as opposed to directiy in a 
data tree) witti reference to a linked list that provkies the same functionality as the extracted data tree. Sample code tor 
tree manager 41 is included in Appendix 3B. 

Formatter 42 is responsible for flattening the extracted data tree Into a linear document and formatting the linear 
document into a personalized newspaper. Formatter 42 performs these functions in accordance witi) the print criteria 
and fomiat infonnalion (i.e.. newspaper template) indicated in personal-news-profile 19. Sanple code for formatter 42 
is included in Appendix 3B. 

In more detail. Figure 7 is a f tow diagram deserving how Web printer 1 7 uses Web reader 34 to traverse the Web 
according to personahnews-prof He ISandtoretrieveartidesfromtheWebaocoidingtotheprofile. excluding unwanted 
data. 

The Web printer starts in step S700, In step S701. W^ printer 17 retrieve either a user designated personal- 
news-profile or a default personal-news-profile stored in disk drive 5 using site driver 51 . In tHs regard, because com- 
puter equipment 1 may be used by more than one user, there may be one or more personal-news-profiles stored on the 
equipment one of whfch will be designated as the default. Upon retrieving the designated persona!-newsi)rof Be. in step 
S702 Web printer 1 7 detennines whether any news data has been prevtously stored to disk drive 5 (for example, by a 
prevfous automatic news delivery) or if news articles should be retrieved using personal-news-profile 19. 

In the case ttiat news data does exist on disk drive 5. in step S703 tiie stored news data is retrieved and ftow pro- 
ceeds to step S801 of Figure 8, discussed in more detail in the next sectfon. On the other hand, if no stored news data 
exists. Web printer 17 invokes Web reader 34 in step S704. Note that this is the same Web reader 34 as discussed 
above with respect to defining a personal-newsixofile. 

Upon being invoked. Web read^ 34 connects to Web sender 35 in step S705. which provides a connection to a net- 
worit, such as the Work! Wide Web Web printer 17 then provides Web reader 34 with an address for the first Web site 
to be visited based on Information retrieved from personal-news-prof fle 19. Once connected to the desired Web site in 
step S706, Web printer 17 provides Web reader 34 vwth commands/links for traversing the Web to the next Web page 
containing information that personal-news-profUe 19 indicates shouW be retrieved. Web reader 34 traverses the Web 
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according to this information in step S707. ' 

In step S708. Web reader 34 retrieves the desired information and sends it to Web printer 17 according to the rules 
in personal-news-profile 19. Thus, data exclusion occurs in this step. The rules in personal-news-profile 19 specify 
structural and content-t>ased criteria for excluding data from the personalized newspaper. The structural rules limit the 

5 retrieved information on the basis of the structure of the Web site accessed by Web reader 34. The content-based rules 
limit the retrie^^ed infomiation on the basis of its content. As mentioned above with respect to creating a personal-news- 
profile, examples of the syntax of the retrieval rules in personal-news-prof9e 19 are included in Appendix 2. 

In addition to rule-based exclusion, media-type exclusion occurs in step S708. wherein data of a media type that 
can not be printed is excluded from the extracted data tree. For exanple. movie and sound data can be excluded. 

10 Web printer 1 7 stores the retrieved data in disk drive 5 (or in main memory 1 4) in the extracted data tree managed 
by tree manager 41 . Alternatively, the data could be stored in blocks with reference to a linked list as discussed earlier. 
In step S709p Web printer 17 returns to step S707 to complete retrieving all Information from Web pages at the Web 
site. In step S710. upon completing a traversal of one Web site. Web printer 17 uses tree manager 41 to compare the 
sites remaining in personal-news-profile 1 9 with the site organization infbmnation in the extracted data tree to determine 

75 if more sites need to be visited. In the case that more Web sites need to be visited, step S71 0 returns flow to step S706 
and news articles are retrieved in the same manner as discussed above. On the other hand, if all of the Web sites listed 
in personal-news-profile 19 have been visited and all of the articles retrieved, f bw proceeds to step S801 in Rgure 8. 

Flattening and Formatting the Retrieved Data 
20 ' . 

Rgure 8 is a flow diagram showing how the extracted data tree is flattened and formatted. The configuration of the 
invention is the same as when retrieving data from the Web (shown in Rgure 6). In fact, the flattening and formatting 
processes can occur, at least to a limited extent, concurrently with the data retrieval process. 

In step S801 of Rgure 8, the extracted data tree is flattened. This sinply means that the organization of the data is 
25 converted from an extracted data tree to a linear document. This step provides the opportunity for excluding more data 
from the personalized newspaper, for example by only irKluding nodes of the data tree into the flattened document This 
exclusion process is controlled by the flattening rules m personal-news-prof ile 19. 

After the data is flattened into a linear document, the data is formatted in step S802 according to the tenrplate indi- 
cated in personal-news-profile 19. The definition of this template, which is either a pre-defined template or a custom 
30 template, was discussed earlier. Finally, in step S803, the formatted and fully personalized newspaper is sent to output 
interface 40. This interface could be printer interface 1 0 to printer 7. display interface 1 1 to cOsplay 2. or even modem^ax 
interface 6; 

Se<?Qn<j En^pciimem : The HTML FpfTOtt er 

35 

The second embodiment of the invention is a system for processing a hypermedia document. The system 
accesses the hypermedia document, extracts addresses from the hypermedia document, and stores the addresses 
extracted from the hypermedia document in a container. The system activates a processing funclk>n to process data 
stored at the addresses stored in the container, downloads the data stored at the addresses stored in the container into 

40 a mennory, and extracts predetermined data from downloaded data in accordance with predetermined configuration 
infonmatton. The predetermined data is then formatted in accordance with predefined formatting settings to generate a 
formatted document, and the formatted document is processed in accordance with the processing functk>n. 

The second embodiment of the invention is depicted as HTML formatter 1 8. noted in Rgure 2. An example of HTML 
formatter 18 is Wek>Formatter, manufactured by Canon Information Systems, Inc. The second embodiment will be 

45 described with respect to WebFormatler. It shoukJ be noted, however, that HTML formatter 18 is not limited to the Web- 
Formatter embodiment, and that various altenriative embodiments within the spirit and scope of the fotkwing description 
are possBDia 

WebFormatter is stand-alone utifity software that can be used in conjunction with different Web t>row8ers, such as 
Netscape. Mosaic and Internet Explorer. In short WebFormatter extracts data from a Web page, strips out extempora- 
so neous data from the extracted data, and refbmriats the data into a fonratted document The formatted document can 
then be printed, stored in an RTF (Rich Text Format) file, or edited in any RTF conpatible editor, such as MS Wbrd, 
WordPerfect, Wordpad, etc. 

WebFormatter can be activated from a windowing environment such as Microsoft Wirxfows®. From such a win- 
dowing environment WebFbmnatter can be activated by double-clicking on a WebFormatter icon (not shown) in a start- 
55 up window, selecting WebFormatter from the Windows start menu, dragging a URL (uniform resource locator) icon (not 
shown) from a Web Ixcwser and dropping it Into the WebFamatter icon, oc tjy automatically invoking WebFormatter 
when the Web browser is started. 

Unlike the first embodiment of the invention descrit>ed akxive, WebFormatter does not use a predefined personal- 
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news-profile to specify criteria tor creating a particular type of document from one or more Web pages. Rather. Web- 
Formatter relies upon user-specified criteria to create a particular type of document such as a newspaper or the like, 
from one or more Web pages. These aiteria are input interactively by a user via a graphical user interfaca 

As described in mae detail below» WebFormatter operates in two nxxJes - a minimized nxxJe and a fully-functional 
nxxJa In the minimized mode, WebFonnatter's graphical user interface is essentially a floating print button, which is dis- 
played concurrently with displayed Web pages. By virtue of this feature, as a user explores the Webt the user can proc- 
ess, format, and print out Web pages by merely clicking on the floating print button. 

In its fully-functional mode. WebFbrmatler's graphical user interface provides spaces for a user to enter a URL 
address of a Web page to be processed, enter a personal title for the document, select a format for the document, pre- 
view a formatted first page of the document, and either print the document save the document as an RTF file, or 
view/edit the document using an RTF editor. The graphical user Interface fbr the fuHy-functional mode will be described 
first, since It is from that interface that the user can enter the minimized mode. 

Figure 9A shows graphical user interface 43 for WebFormatters fully-functional moda Graphical user interface 43 
Is displayed on display 2 upon first activation of WebFamatter. As with any interactive windowing software application, 
a user interacts with graphical user interface 43 by means of mouse 4 (by pointing and clicking) and keyt)oard 3. 

As shown in Rgure 9A, graphical user interface 43 includes fieWs 44 and 46 to 49, through which a user can specify 
the URL address of a document to be formatted and the format of that document. Beginning with URL field 44, a user 
enters the URL address (ag., httpyAvww.ds.canon.com/!is/tis_home.htm) of a Web page to be processed by WebFor- 
matter. There are several different ways fbr the user to enter the URL address. The user can (1) type the address 
directly into URL field 44, (2) copy the URL address in the Web browser and paste the URL ad^ess Into URL field 44. 
(3) drag the URL address from the Web browser onto graphk»l user interface 43 or onto the WebFormatter fcon, or (4) 
dick on Current URL button 54. 

With regard to Current URL button 54, if a user dicks on Cunent URL button 54, WebFormatter locates the active 
Web browser and queries the Web browser for the address of the current Web page. Thereafter, the Web browser pro- 
vides the address of the current Web page to WebFomnatter, which places the address in URL address fiekJ 44. Iff URL 
button 54 is activated and no Web browser is currently mnning. WebFormatter displays dialog box 56, shown in Figure 
9A. 

As shown, dialog box 56 indudes Cancel button 57 and Launch Browser button 59. Cancel button 57 cancels a 
users request to input a URL address into URL address field 44 via Current URL button 54. Launch Browser button 59. 
on the other hand, launches a Web browser spedf ied in WebFormatter. As noted below, WebFonnatter is configured 
beforehand with predefined infomnation induding a Web browser to be used with WebFormatter. Configuratfon of Web- 
Fonmatter will be descrft>ed In more detail below. 

In alternative embodiments of WebFornratter. a filename can also be entered into URL address f ieW 44. For exam- 
ple, in these alternative embodiments, if a user wishes to fonnat a hyper-linked manual into a book-like fomiat, the user 
enters the filename into URL address field 44. Thereafter. WebFormatter proceeds through the file In the same manner 
as through specified Web pages in order to reformat the hyper-linked manual as desired. 

Returning to graphical user interface 43, title UM 46 enables a user to enter a personafized title fbr a formatted doc- 
ument The title may be typed directly or pasted into title f ieW 46. 

Formatting f iekis 47 to 49 define the tomnat of a document to be output by WebFonnatter. Options for the different 
formatting fields can be accessed by dicWng on a saoll bar. such as saoll bar 55. of a respective formatting f ieW. Each 
of these f ieUs is described in detail below. 

Styles field 47 provides four options fbr formatting an output document These styles relate to characteristics off an 
output docmient such as size off headers, margns. etc. The style options indude Contemporary, Fonnal. Fun and Pro- 
fessional. The invention, of course, is not limited to these four style options, and otiier styles can be added as desired. 

Columns field 48 defines the number of columns in a formatted output document. Two columns options are availa- 
ble - Single and Multiple; however, the inventfon is not Omited to these two options. The Single option, as might be 
expected, formats the document into a single column. The Multiple option, on the other hand, formats the document into 
a predetermined number of columns. In preferred embodiments of the inventfon. the multiple option is set to two col- 
umns: however, any nurrijer can be set 

Spadng field 49 defines ttie spacing between lines in a fonnatted output document. Three options are provided in 
WebFonnatter. but other options can be added as desired. These three options are Condensed. Normal and Easy To 
Read, with Condensed being the least amount of spadng between tines and Easy To Read being the most amount off 
spadng between lines. 

Graphical user interface 43 Is also provWed with Preview button 60. By diddng on preview button 60. a user can 
preview a first page off a fomiatted document in viewing area 61. An example off a previewed fonnatted document is 
shown in Figure 9A. 

As shown in Figures 9A ^ 9B. WebFormatter also includes Options button 61 . Options button 61 provkies a user 
with additional formatting options which are used by WebFormatter to create a formatted document. A user can activate 
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Options button 61 by dicWng thereon. This causes Options dialog box 62, shown in Rgure 9B, to appear on display 2. 

As shown in Figure 9B. options dialog box 62 includes General options 64. Container options 66 and Strip Meta 
Info options 67. General options 64 includes Text only" listbox 72» "Index of links in the page" listbox 73, and "No float- 
ing pictures" listtxsx 74. These options are indicated as being selected by a check mark or the like in a respective listbox. 
As will become dear from their desaiplions. more than one of the options in General opttons 64 can be selected at the 
same time. 

Text only^ listbox 72 instructs WebFormatter to strip all graphics in a Web page and print only text therein. "Index 
of nnks in the page" listbox 73 instructs WebFonnatter to add a fist of all URLs present in a Web page or pages to the 
end of a formatted document. Preferably, the fist of URLs is printed as supersaipt. and anchor positions of the URLs in 
the list are marked In boW. "No floating pictures" listbox 74 instructs WebFormatter to print all images in the document 
In a partfcular area of the formatted document In some cases, therefore, when this option is selected; WebFormatter 
shrinks images, as needed, so that images fit into a particular area. 

Strip Meta Info options 67 provides engineering options which fadlitate stripping of unnecessary information from 
a Web page being processed by WebFormatter. The options indude (1) "None", which instructs WebFormatter to strip 
nothing from the Web page, (2) Till the first horizontal rule", whfch instrucls WebFonnatter to strip all links and images 
until and up to predefkied first and second horizontal formatting rules (e.g., up until a horizontal line across a page), and 
(3) Till the first text", which instructs WebFormatter to strip ail links and images up to first and last occurrences of text 
in the Web page. Only one of Strip Meta Info options 67 can be selected at a time: Selectfon thereof is indicated by a 
dot in a bullet located next to an option, as shown in Figure 9B. 

Container options 66 provides optfons for processing documents, addresses for which are stored in container 76 
shown in Rgure 9B. Prior to describing Container options 66. a descriplion of container 76 will be provided. 

As noted, container 76 stores URL addresses of selected documents. Document addresses which are input to field 
44 are added to container 76. The order in which URLs are input into container 76 denotes the order in which data in 
the URLs is processed by WebFormatter. As shown in Rgure 9B, once container 76 becomes full, its icon changes to 
that shown by reference numeral 77. 

When a user dicks on ttie icon for container 76. menu 77 is displayed. Menu 77 provides five options; l.e.. Open 
79. Enrpty 80. Print 81 . Edit 82 and Save 84. These optfons are highlighted when activated, and are described in detail 
below. 

Open 79. when activated, displays Container Contents screen 87 shown in Rgure 98. Container Contents screen 
87 shows the URL addresses stored in container 76. Container contents screen 87 provides four buttons; I.e., Add cur- 
rent URL button 88 which adds the cun^ent URL to container 76, Delete button 89 wNch permits a user to highlight and 
delete a URL in container 76, Empty button 90 which permits a user to empty container 76, and Done button 91 whfoh 
pemiits a usei- to dose Container Contents screen 87. It is noted that a user can also enpty ttie contents of container 
76 by dicWng on Empty 80 of menu 77. 

In addition, the user can rearrange the order of URLs stored in container 76 by dragging and dropping different 
URU at different tocations therein. As noted above, since the URLs are processed in the order that they appear in con- 
tainer 76, this feature pemriits a user to rearrange the processing order of the URLs in container 76 interactively. 

Print 81 , Edit 82 and Save 84. when activated, cause WebFormatter to download all data at Web pages defined by 
the URLs stored In container 76, formal them as specified by the user, create RTF file(s) storing the formatted Web 
pages, and do the selected action. i.e.. save, edit or print the RTF file(s). This process is described in greater detail 
b^ow. 

Referring back to Options dalog box 62, Cont^er optfons 66 include "Print table of contents" fistbox 92 and 
"Empty after processing" listbox 94. As shown, a check mark appears in a listbox to indicate that the fistbox has been 
selected. In tt^s regard, more ttian one fistbox can be selected at a time. "Print table of contents" fistbox 92. when 
selected, instructs WebFormatter to print titles of all URU in container 76 as a table of contents in a fonnatted output 
document "Empty after processing" listbox 94, when activated, instrucls WebFonnatter automaticaOy to empty con- 
tainer 76 after printing, editing or saving a document without waiting for a user to do sa 

Also shown as part of Container options 62 are Select RTF Editor button 69, Cancel button 70 and OK button 71 . 
By clicking on Select RTF Edtor button 69, a user can select an RTF file editor, examples of which are noted above. 
This can be done, for example, by displaying anottier dialog box listing predefined RTF editors (not shown) and select- 
ing one of tt)e predefined RTF ecfitors. Cancel button 70 cancels Container options 62 and OK button 71 confirms 
selected options in Container options 62 and then closes its dialog box 

As shown in Rgure 9B, graphical user Interface 43 also indudes print foon 96. edit fcon 97, save icon 99, help but- 
ton 100. done button 101 and minimizing fcon 102. A user may select any of these features by diddng thereon using a 

HKXJSe. 

Print icon 96 opens a print diatog box (not shown), which allows a user to print any number of copies of Web pages 
formatted by WebFormatter. Edit icon 97 opens an RTF file storing fonnatted Web page(8) for editing by a predeter- 
mined RTF editor. Save icon 99 opens a save dafog box (not shown), which alfows tfie user to name and save a for- 



11 



EP0a34 822 A2 



matted Web page as an RTF file. Help button 100 provides heft> messages for operating WebFamatter. and Done 
button 101 exits from WebFormatter. Minimizing icon 102 activates the minimizing mode of Webfbrmatter wtiich was 
mentioned above and which is desaibed in greater detail t>elow. 

Rgure 9C shows menus provided by WebFormatter during its operation. These menus include file menu 103. edit 
menu 104 and window menu 106. File menu 103 provides '*Save^ "Edit- and Trint" options, the functions of which are 
identical to those of Save icon 99, Edit icon 97 and Print icon 96, respectively. An "Exit" option is also provided to exit 
from File menu 103. Rnally, File menu 103 provides "Open HTML file" option 107. This option provides a user with the 
capability to open a local HTML file; i.e., a hypermedia file resident on the users computer such as a file saved from 
NetScape, or URL files aeated by dragging and dropping a URL onto the vwndows desktop. "Open HTML fOe" option 
1 07 also provides hooks needed to open files created by other Wet>-f ile-processing products so that those files can be 
formatted as RTF files and printed, saved and/or edited using WebFonnatter. 

Edit menu 104 provkles "Paste URL" optton 109. "Paste URL" optfon 109 pastes the contents of a paste buffer, 
such as a URL address copied from a Web page, into URL f iekJ 44. as described above 

Window menu 1 06 provides a "Help Topfcs" option which provides a user with information regarding the use. main- 
tenance and background of WebFormatter. and an "About WebFormatter optfon which provkies a user with a diatog 
box (not shown) containing WebFormatter's version number and copyright notfce(s). Window menu 106 also includes 
"Preferences" option 1 10. "Preferences" option 110 opens preferences diatog box 1 12, shown in Figure 9D. 

Preferences (Salog box 112 is used to configure and re-configure WebFonnatter. As shown in Figure 9D. prefer- 
ences dialog box 1 12 includes Minimize view options 1 1 3, General options 1 1 4 and WWW Browser to use options 1 15. 
Minimize view options 1 13 can be set to configure WebFonnatter's graphical user interface In the nanimized mode. Two 
sets of options are provkled. The first set include "Print". "EdiT and "Save". These optfons coirespond to print kx>n 96. 
edit icon 97 and save foon 99. shown in Figure 9a When a check mark appears in a fistbox next to one of these options! 
the kxxi for that option is displayed in the minimized mode. e.g.. the print icon, ttie edit icon and/or the save icon. More 
than one option can be selected at once. In this regard, Rgure 9E shows graphical user interface 1 16. which is a rep- 
resentative example of a graphical user interface for WebFormatter when WebFormatter is in ttie minimized moda 

Referring back to Figure 9D. Minimize view options 1 1 3 also include "Row" and "Stack" options. These options can 
be set to display WebFormatteTs graphical user interlace In the minimized mode horizontally by selecting "Row" or ver- 
tically by selecting "Stack*. Only one of these options can be selected at a time. As an example of the foregoing, graph- 
ical user interface 1 16 corresponds to a row of icons. 

WWW Browser to use options 1 15 delennine whteh WorW Wide Web browser Is to be used with WebFormatter. As 
shown, preferably NetScape, Internet Explorer and Mosak? are provided as browser options; however, other browser 
options can also be provided. As might be expected, only one of these options can be selected at a tima The default 
browser option is NetScape Navigator. 

General options 114 include "Auto-start witii browser" option 117. "Open in minimized view" option 118. "Warn 

before printing more ttian pages" option 119, and "Warn before saving more than MBs" option 120. "Auto-start 

with browser" option 1 1 7 sets WebFormatter to be invoked automatically when a Web browser is activated. If titis option 
is not selected (which is tfie default). WebFonnatter is opened by double dicWng on a WebFormatter kx>n in the win- 
dowing environment selecting WebFormatter from the Windows start menu, or dragging and dropping a URL from tiie 
Web browser Into the WebFbnmatter icon, as described in more detail abova "Open in minimized view" optfon 118, 
when selected, opens WebFormatter in minimized moda The default, however, is ttne fully-functional moda "Warri 
before printing more than _ pages" option 1 19, and "V\tern before saving more ttian _ MBs" option 120 alkiw a user 
to control the number of pages saved of a fonratted document and ttie amount of memory space used by those pages, 
respectively. The defauft tor botti of tiiese options is for no warning to be givea As is evident, more ttian one of ttie gen^ 
eral options can be selected at the same tima 

Preferences dialog box 1 12 also includes cancel button 121 whfch cancels a user's selected preferences and OK 
button 122 which confinns a user's selected preferences. 

As explained above. WebFormatter can be configured to enter directiy into ttie minimized mode via Preferences 
dialog box 1 12. or a user can enter the ntinimlzed mode via minimizing icon 102 shown in Figure 9B. As also noted 
above. Figure 9E shows an example of graphical user interface 116 for WebFormatter in tiie minimized mode. Graphfoal 
user interface 1 16 is dispfayed as a floating interface vi^hfle a user Is exploring ttie Web. Thus, as a user views a Web 
page, the user also views graphical user interface 1 16. By drcking on an appropriate Icon on graphfoal user interface 
116 (whfoh. in Rgure 9E. includes icons identical in botti structure and function to ttiose shown in graphical user inter- 
face 43). ttie user can capture ttie current Web page, process and fbmnat ttie Web page into an RTF f Ba and sava edit 
and/or print ttie RTF f ila Alternatively, ttie user can drag a URL from ttie Web browser and drop it into one of ttie icons. 

A user can reconfigure WebFormatter in ttie minimizing mode by double dfcking a right mouse button. This action 
causes a preferences dialog box to appear on dispfay 2 which is Identical to preferences dialog box 1 12. Thereafter, ttie 
user can atter ttie configuration of WebFonnatter as desired. Should a user wish to enter ttie fully-functional mode from 
the minimizing mode, ttie user need merely click on maximizing icon 117 shown in Figure 9E. 
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Figure 10 is a fkw diagram desaibing the operation of WebFormatler. WebFomiatter is activated in step S1000. 
As described above, this can be done by double^UcWng on a WebFonrotter icon in a windowing environment Depend- 
ing upon how WebFormatter has been configured, i.e., in the fuHy-functional mode or the minimizing mode, either a 
graphical user interface similar to that of graphical user interface 43 or one similar to that of graphical user interface 1 1 6 
is displayed in step SIOOO. For the sake of completeness, the following assumes that a graphical user interface similar 
to that of graphical user interface 43 is displayed in step SIOOO, since the default mode of WebFomiatter is the fully- 
functional mode. 

Next in step S1001 . WebFormatter is configured, as described above via preferences dialog box 1 12 and options 
dialog box 62. This step is not necessary unless a user wishes to change WebFormatter's previously set configuration. 
In step S1002, document format data is input in fields 44 and 46 to 49 desaibed above. More specifically, the user 
inputs a URL (or fBename in alternative embodiments) into URL field 44. As described below, WebFormatter uses this 
infbmiation to process Web pages stored at the URL to aeate an RTF file t«sed on the configuration of WebFamatter 
and the data input in fields 46 to 49. 

In step SI 003. a Web reader similar to that of Web reader 34 desaibed above is executed. The Web reader con- 
nects to a networK such as the World Wide Web^ In step 81004. 

Next in step 81005, it is deternrrined whether a URL or a filename has been entered. As described above, in pre- 
ferred embodiments of WebFormatter. only a URL may be entered. However, since alternative embodiments of Web- 
Formatter may pernrBl entry of a filename, a desaiption of processing a file other than one at a URL adcfress will be 
provided. 

If a URL has been entered In field 44, processing proceeds to step S1006. In step S1006, the Web reader accesses 
the hypemiedia document (e.g.. a homepage) specified by the URL address. In step S1007, WebFomiatter instructs 
the Web reader to traverse the hypermedia document Thereafter. WebFormatter selects URL address(es) from the 
Web and stores the addresses in container 76. Once all desired addresses have been selected and a processing func- 
tion, such as print has been activated, WebFormatter downloads data stored at the addresses in container 76 into 
memay 5. WebFormatter then extracts predetenmined data from the downloaded data based on the configuration infor- 
mation set in Optional dialog box 62, and staes the extracted data in memory 5. Thus, for example, if Text Only" option 
72 in Options Window 62 is on, only text is extracted from the downloaded data Processing then proceeds to step 
S1011. 

On the otfier hand, it in step SI 005, a filename for an HTML source file is entered, WebFomiatter instaicts the Web 
reader to access afirst site in the fHe. In steps S1008 and 81009, the site is traversed and data is extracted and stored 
in the same manner as In step S1007, descrbed above. Then, in step 81010. WebFormatter determines if more sHes 
are listed in the HTML source ffla H more sites are listed in the file, flow returns to step S1008, and the next srte is 
accessed. If no more sites are present processing proceeds to step Si Oil . 

In step S101 1. WebFormatter processes the extracted data in accordance with the previously set fonnat informa- 
tion. For example, if Columns field 48 is set to multiple, the extracted data will be fonnatted into a document having mul- 
tiple columns. The above processing is initiated by activating one of Print icon 96. Edit icon 97 or Save icon 99. and is 
similar to the processing descrtoed above in the first embodiment, ag.. flattening the document and formatting the doc- 
ument based on the formatting infomiation. Accordingly, a detailed description therecrf is omitted for the sake of brevity. 

Once the documents whose URLs are stored in the container have t>een downloaded, fonnatted according to the 
preset fom^ts and configurations, and converted into RTF f 8e(s) in step SI 01 1 , in step SI 012, the RTF ffile(s) are out- 
put Aftematively, the RTF files(s) can be edHed or saved, depending which icon on the graphical user intertoce 
has been activated. 

The invention has been described with respect to particular illustrative embodiments. It is to be understood that the 
invention is not Med to the above described embodiments and modifications ttiereta and that various changes and 
modifications may be made by those of ordinary skill in the art without departing from the scope of the appended dalms 
as defined in the appended claims. 
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SAMPLE USER PROFILE 

The User Profile is implemented in vindows.ini file 
format. 

[Defaults) 
Count=4 

Title^My Daily Paper 
[1] 

Heading=News In Brief 
Site=l 

Section-Front Page 

MaxLevels-5 

Meocpages»10 

MaxKBy t es=2 000 

Date=today 

Print=level 0 

Template=l 

(2] 

Heading=Sports In Brief 
Site=2 

Section=Sports 
MaxLevels=0 
MaxPages=10 
MaxKbytes=200 

KeywordFilter="Football" AND "49ers" 

Date=today 

Print^level 0 

Teiiiplate=l 

[3] 

Heading-Money Natters 
Site=l 

Section-Business 
MaxLevels»l 
MaxPages-100 
MaxKBytes==2 0000 

KeywordFilters="Computer" OR "hardware" OR 

"Software" 

Date=^today 

Print=all 

Teinplate=2 

[4] 

Heading=Sri Lanka 
Site=3 

Sect ion=Ho tNews 

MaxLevels^l 

MaxPages=100 

MaxKBytes=20000 

Date=today 

Print= leaves 

Teinplate=2 
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BAMPLB 8XTB PROFILES 
#Legend: 

/%W-day of the week 
#%s-section part of URL 

[Defaults] 
Count«3 

[1] 

Title»san Jose Mercury Nevis 

Usernane^iDvickraiD 

Pa88word»cannon 

StartDataaStartHeadlines 

EndData»EndHeadlines 

Home Page^http : / / www . s jmercury . com/ ^ 

Sect ionURL=>http : / / www • s jmercury • com/ %S • htm 

SectionCountB9 

Section IwFront Page 

Section 2aIntemational 

Section 3BNationaI 

Section 4eIiOcal & Stat^ 

Section SsEditorials Commentary 

Section 6=Bu8ines8 

Section V^Sports 

Section 8«Living 

Section 9=Entertainment 

[1. Sections] 

Front Page=front 

International=intl 

Nat iona l=nat 1 

Local 6 Statealoc 

Editorials & Commentary^edit 

BusinessBbiz 

Sports-spts 

Living=liv 

en ter ta inment=ent 

[2] 

Title=The San Francisco Chronicle 

Home Page=http : / /www • sf gate . com/chronicle/ 

SectionURL="http: www. sfgate.com/cig- 
bin/ chronicle/ article- 
list.cgi?%/S: /chronicle/today" 

Section CotmtsS 

Section 1-News 

Section 2~Business 

Section 3=Sports 

Section 4=Editorial 

Section 5=Datebook 
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[2 .Sections] 
Nevs«New8:MN 
BuslnessBBusiness : BU 
Sports=sports : SP 
Edltorlal»Editorial : ED 
Datebook»Dateboo)c : DD 



[3] 

Title=The Day News 
Home page" 

http: / /www. landa. net/lakehouse/anclWeb/dailynew/ 
SectionDRl>"http: //www» lanka.net/lakehouse/anclW 

eb/dailynew/ %W/WS . htal " 
SectlonCount «12 
Section l>=Bu8ines6 
Section 2«Editorial 
Section 3»Features 
Section 4"Poreign 
Section 5»Letters 
Section 6«InBrief 
Section 7«HotNews 
Section 8«=Probes 
Section 9=Military 
Section 10«Politics 
Section ll^Obituaries 
Section 12=Sports 

(3. Sect ions] 

Businesss^business/ intro 
Editorial«editorial/final 
Featur es«f eatures / intro 
Foreign=f oreign/ intro 
Letters«letters/final 
InBrief=inbrief/ intro 
Ho tNews-hotnevs / intro 
Pr obes=pr o ves / intro 
Military=nnilitary/ intro 
Politics=politics/ intro 
Obituaries^biturai/ intro 
Sports^sports / intro 
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APPENDIX 2 

SYKTAX FOR RBTRZEVM., EXTRACTIOM AMD FRIMTIMG 
CRITERIA 



Maximum levels to search: MaxLevels=<#> 
-1: to retrieve all levels 
0-n: to retrieve up to n levels 

Maximum pages of the document: MaxPa9ess<#> 
n: final document not more than n pages 

Maximum size of the document: MaxKBytes=<#> 

n: document size not more than n kilo bytes 

Exclusion rules: 

Date=today I lessthan <#> 

today: retrieve only articles posted today 
lessthan </>;n: retrieve only articles no 
more than n days old 
Retrieve»all j nosubdir | nothisdir | thissiteonly 

all: allow to fetch pages from other sites 
nosubdir: exclude URLs to subdirectories 
nothisdir: exclude URLs in this directory 
thissiteonly: fetch pages from this site 
only 

Keyword search: 

KeywordFilter=<keyword> (AND | OR | NOT) <keyword>: 

accxamulate only pages containing the 

combination of keywords 
KeywordRank«<#>;n: use fuzzy logic to rank 

pages according to keyword combination in 

KeywordPilter and keep top n ranked pages 
KeywordAuthor=<author>: accumulate only 

pages authored by author 
ExcludeType=ads } nonEnglish 

ads: exclude advertisements 

nonEnglish: exclude articles that are not in 
English 

Flattening rules : Print=all | leaves | level=<#> 

all: include all nodes in the tree in the linear 
document 

leaves: include all leaves in the tree in the 

linear document 
level=<#>;n: include up to nth level of the tree 

in the linear document 

Formatting rules: Template=<#> 

n: print according to default or user template 
number n 
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A?PE>ID?:y 3 
DESCRIPTION OF MODULES 
Appendix 3A 



THE VZRBOVhL NEWS PROFILE EDITOR MODULE 



The Profile Editor manages access to the user 
profiles and is represented by CProfileMgr class. 
It also manages loading and saving of the profiles. 
The services provided by Profile Editor are: 

BOOL NewProfile(CString f ileName) ; 

Creates a new profile given the 

file name. 
BOOL OpenProfileO; 

Open^ the default profile. 
BOOL OpenProfile(CString f ileName) ; 

Opens the named profile. 
CProfileEntry* GetFir stEntry 0 ; 

Loads and returns the next 

profile entry. 
CProfileEntry* GetNextEntry 0 ; 

Loads and returns the next 

profile entry* 
BOOL WriteEntry (CProfileEntry & entry); 

Saves a new entxry in the profile. 



Each profile entry contains an extraction 
specification and an output specification as 
represented by CProfileEntry class. The methods 
provided are: 

CURL GetSiteldO; 

Retiurns the site id contained in 
the profile entry. 

CExtract ionSpec GetBxtractionSpecQ ; 
- Returns the extraction 

specification contained in the 
profile entry. Extraction 
specification contains keywords 
for searching, limits for levels, 
pages, size in kilo bytes. 

COutputSpec GetOutputSpecQ ; 

Returns the output specification 
contained in the profile entry. 
Output specification contains 
formatting instructions and tree 
traversal rules. 
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THE Web READER MODULE 



CWebPage class abstracts the interface to the 
Internet browser and is representative of the actual 
Web page. It will be responsible for fetching a Web 
page, extracting links or references to other URLs 
in the Web page, and maintaining the contents of a 
Web page. The methods provided are: 

BOOL LoadQ; 

Fetch the Web page using the URL, 

username and password. 
BOOL ParseQ; 

Parses the data in the Web page 

and creates a list of links. 

Also resolves the relative URLs 

into absolute URLs. 
CURLList* GetLinksQ; 

Returns the list of links in the 

Web page. 
CPageData* GetDataQ; 

Returns the actual text data 

contained in the Web page, 
void Fi Iter Content 0; 

Extracts title and other 

information according to the site 

data. 

CString GetTitleQ; 

- Returns title and other 
information according to the site 
data. 

CString GetAuthorQ; 

- Returns the author of the Web 
page. 

int GetSizeO; 

- Returns the size of the data in 
kilo bytes. 



CNetwork class will encapsulate OLE functionality 
and provides communication with the Internet 
browser. 

CString GetUsernameQ ; 

Determine the currently set 

username. 
void SetUsername(LPCTSTR) ; 

Set the current username in the . 

CNetwork object. 
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CString GetPasswordQ; 

Determine the currently set 

password, 
void SetPasBWord(LPCTSTR) ; 

Set the current password in the 

CNetworlc object, 
void CloseQ; 

Disconnect any active connection 

and reset the CNetwork object, 
short Read(BSTR*pBuffer, shortiAmount) ; 

Read data retrieved by the 

Browser, 
long GetStatusQ; 

Query the status of the current 

load. 

BOOL Open(LPCTSTR pURL, shortiMethod^ 
LPCTSTR pPostData, long IPostDataSice, 
LPCTSTR pPostHeaders) ; 

Initiates the retrieval of a URL 

from the network. 
CString GetErrorHessageQ ; 

Provide the caller with 

internally generated error 

messages, 
short GetServerStatusQ; 

Determine the error status 

reported by the server, 
long GetContentTypeQ ; 

Return the content length (total 

amount of bytes) of the current 

load. 

CString GetContentEncodingQ ? 

Return the MIME encoding of the 
current load. 

CString GetExpiresQ; 

Return when the data retrieved by 
this load is no longer considered 
valid.. 

CString Resolve-(LPCTSTR pBase, LPCTSTR 
pRelative) ; 

Generate an absolute (fully 

qualified) URL. 

BOOL IsFinishedQ? 

Determine if a load- is complete. 

short BytesReadyO; 

Inform the caller of the number 
of bytes prepared to be read. 
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THE SITE DRIVER MODULE 

The Site Driver will provide the site information to 
the Web Reader. The Site Driver is functionally 
similar to the Profile Editor and is represented by 
CSiteDriver class. Services provided are: 

BOOL NewProfile(CString fileName); 

Creates a new profile given the 

file name. 
BOOL OpenProfileO; 

- opens the default profile. 
BOOL OPenProfile(CString fileName); 

- Opens the named profile. 
CSiteProfile* GetFirstSiteQ ; 

- Loads and returns the first site 
entry. 

CSiteProfile* GetNextSiteQ; 

Loads and returns the next site 
entry. 

BOOL WriteEntry{CSiteProfile& entry); 

- Saves a new entry in the profile, 
int NumberOfSitesQ; 

- Returns the number of sites 
specified in the profile. 



An entry in the site profile will contain 
information about the base URL of the site, title of 
the news source, information about how to access the 
site, and various other information such as section 
data etc. and will be represented by CSiteEntry 
class. Methods provided are: 

CString GetURLQ; 

Returns the base URL of the site. 
CString GetUsernameQ ; 

- Returns the username for the 
site. 

CString GetPasswordQ; 

- Returns the password for the 
site. 

CString GetTitleQ; 

- Returns the password for the 
site* 

CString GetTitleQ; 

- Returns the title of the news 
source. 

int SectionCountQ; 

- Returns 
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Appendix 3B 

TRBB MANAGER MODULE 



Tree Manager will maintain the most central data 
structure in this program, which is a tree of Web 
page nodes and is represented by the CPageTree. 
CPageTree will traverse the WWW to retrieve the 
necessary Web pages according to the extraction 
specification and builds the tree. The methods 
provided are: 

CPageTreeNode* GetRootQ; 

Returns the root node of the 
tree* 

BOOL Build (CURL URL, C£xtractionSpec& 
spec) ; 

Builds the tree according to the 
personal news profile extraction 
specification* 



Each node in the page tree is represented by a 
CPageTreeNode. Methods provided are: 

BOOL AddChild(CWebPage* page) ; 

- Adds a child node with Web page 
data* 

CWebPage* GetPageQ; 

- Returns the Web page contained in 
the node. 

int NumberOfChildrenO; 

- Returns the number of children 
belonging to the node. 

BOOL IsLeafO; 

- Returns TRUE if a leaf node, 
i.e., no children. 



To traverse the Web page tree, a CTreelterator class 
is defined with different traversal methods. 
Methods provided are: 

void ResetQ; 

- Cancels the current transferral, 
and initializes state data. 

CPage TreeNode* GetNextNodeQ ; 

- Returns the next node in the tree 
in a depth first search* 
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CPage TreeNode* GetNextSiblingQ; 

Returns the next node in the tree 
in a breadth first search. 

CPage TreeNode* GetNextLeafQ; 

Returns the next leaf in the tree 
in a depth first search. 



THE FORHATTER MODULE 



Input to this module will be the Web page tree 
created by the Tree Manager and the output 
specification contained in the user profile. 
Formatter will traverse the tree according to the 
rules specified in the output specification and the 
final document will be formatted using the 
formatting instructions in the output specification 
and the formatting contained in the Web pages such 
as headings, paragraphs and lists etc. 



The output document will be in Rich Text Format 
(RTF) and will be accessible by many applications. 
RTF is^ advanced formatting language for text, 
providing document, section and paragraph 
formatting, style sheets, headers and footers, and 
with support for Uniconde. Image formats supported 
are DIB, DDB, WMF, OS/2 metafiles. There is no 
support for Web images which are of the GIF format. 
A third party library will need to be purchased in 
order to do the conversion of the GIF to DIB format 
or one can be developed in-house. 



The prototype creates a HTML file as the output. 



The formatter is represented by the CFormatter 
class. The methods provided are: 

BOOL OpenHTMLFile(CString f ileName) ; 

- Opens the named HTML file for 

output, 
void CloseHTMLFileO; 

Closes and saves the HTML file. 
BOOL PrintHTML(CPageTree& root, 
COutputSpecfi format); 

Given the root and the output 

specification, traverses the tree 

and prints the contents in the 
. Web pages in HTML format. 
BOOL OpenRTFFile(CString f ileName) ; 

Opens the named RTF file for 

output . 
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/ void CloseRTFFileO; 

Closes and saves the RTF file. 
BOOL Print»TF(CPa9eTree& root, COutputSpecfi 
format) ; 

Given the root and the output 
specification, traverses the tree 
and prints the contents in the 
Web pages in RTF fomat. 



BOOL Print (CPageTree& root, COutputSpec& format) ; . 

Given the root and the output specification 
at, traverses the tree and prints the 
contents in the Keb pages to the default 
printer. 



Claims 



1 . A methcxl for formatting data from at least one hypermedia document compr^ing the steps of: 

an accessing step to access the at least one hypermedia document; 

a retrieving step to retrieve data from the hypermedia document Into an extracted data tree, wherein the data 

Is retrieved based on a structure of the hypermedia document; 

a flattening step to flatten the extracted data tree into a linear document; and 

a fbrmalling step to format the linear document Into a formatted document. 

2. The method of Claim 1 , further comprising the step of printing the formatted document. 

3. The method of Oaim 1 . wherein said hypermedia document is located on the World Wide Web. 

4. The method of Claim 1 , wherein said hypermedia document is located on the Internet. 

5. The method of Claim 1 , wherein said hypermedfei document is located on an Intranet 

6. The method of Claim 1 , wherein said accessing stept said retrieving step, said flattening step, and said fbmnatling 
step are performed in accordance with a personal-news-profile. 

7. A nr^thod of creating a personat-news-prof lie for retrieving data from a hypermedia-Knked cxmputer network, com- 
prising the steps of: 



accessing the hypemnedia-rinked computer network; 
entering a learning mode; 

traversing sites on the hypermedia-linked computer network with commands; 
extracting at least one rule from the commands; and 
compiling the at least one rule into the personal-news-profile. 

8. The method of Qaim 7. wherein the at least one rule specifies structural characteristk:s of sites for traversing the 
hypermedia-linked computer network. 

9. The method of Qaim 8, wherein the at least one rule also specifies content-based criteria for traversing the hyper- 
media-linked computer network. 

1 0. A personalization system for aeating a personalization profile for a Web site retrieval data retrieval system, the per- 
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sonalization system comprising: 



an input device for inputting data and convnands to access the Worid Wide Web; 
a connection to the World Wide Web; 

a memory for storing a Web reader, the Web reader for accessing the World Wide Web via the connection to 
the World Wide Web according to commands from the personalization system; and 
a processor for launching the personalization system In response to a user command, wherein the personali- 
zation system, upon being launched. (1) launches the Web reader. (2) accesses the World Wide Web via the 
Web reader. (3) enters a learning mode. (4) sends commands to the Web reader to traverse the World Wide 
Web according to user commands. (5) extracts at least one rule from the user commands, (6) compiles the at 
least one rule into a personalization profile, and (7) stores the personalization proTile. 

11. A method for retrieving articles from a hypermedia-linked computer network and for formatting the articles into a 
personalized newspaper, the method comprising the steps of: 

retrieving a stored personal-news-profiie which comprises address data for a site on the hypermedia-Gnked 
computer netvwk. command data for accessing data from the site, and newspaper layout commands; 
contacting the site based on address data stored in the personal-news-profile; 
downtoading articles from the site based on command data stored in the personal-news-profile; 
flattening the articles into a linear document; and 

formatting the linear document into the personalized newspaper according to l^out commands stored in the 
personal-news-profile. 

12. The method of Claim 1 1 . further conprising the step of printing the personalized newspaper. 

13. The method of Claim 11, wherein said hypermedia-linked computer network is the WorW Wide Web. 

14. The method of Claim 11, wherein said hypermedia-linked computer network is on the Intern^. 

1 5. The method of Claim 1 1 . wherein said hypermedia-linked computer network is on an intranet 

1 6. The method of Claim 1 1 , wherein the command data for accessing data includes data for selecting articles based 
on a structure of the site. 

17. The method of Claim 16. wherein the command data lor accessing data also includes data for selecting articles 
based on a content of the articles. 

18. A WbrW Wide Web site data retrieval system for accessing at least one Web site, for retrieving data from the Web 
site, and for formatting the data Into a personalized document, the system comprising: 

an input device for inputting data and commands to access the Worid Wide Web; 

a memory for storing a Web site data retrieval driver wKch includes a Web reader, stored Web site address 
information, stored Web site commands, and stored fbnnat information, wherein the memory also includes 
process steps to connect to a Web site and to issue commands within the connected Web site; 
a connection to the World Wide Web; and 

a processor for launching the Web site data retrieval driver in response to a user inputting a command to 
access ttie Wald Wide Web, vi^erein the Web site retrieval driver, upon being launched, (1) lauiches ttie Web 
reader to connect to the World Wide Web via sakJ connection, (2) retrieves tiie Web site address information 
and Web site commands. (3) instructs the Web reader to access the Web site based on tiie Web site address 
information and Web site commands. (4) downtoads Web site data from the Web site based on the Web site 
commands, wherein tiie data is downloaded with reference to a linked list so as to avoid hypermedia-links ttiat 
fonn loops and so as to avoid repetitious downloading of data that has already been downloaded, (5) stores 
tiie Web site data in a linear document, (6) repeats steps 1 through 5 until all addresses iri the stored Web site 
address information have been accessed, and (7) formats ttie linear document into the personalized document 
based on the format information. 

19. The Web site data retrieval system of Claim 18. wherein the Web srte address information, tiie Web site com- 
mands, and tiie format information stored in tiie memory form a personalized-news-profile. 
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20. The Web site data retrieval system of Claim 18, further comprising a printer for printing the personalized document. 

21. The Web site data retrieval system of Claim 18, wherein the personalized ctocumenl represents a personalized 
newspaper. 

5 

22. The Web site data retrieval system of Claim 18, wherein the personafized document represents a personalbred 
magazine. 

23. The Web site data retrieval system of Claim 18. wherein the personalized document represents a personalized 
10 book. 

24. Computer executable process steps stored on a computer-readable medium, said stqps for accessing World Wide 
Web sites for retrieving data at the sites and for formatting the data into a personalized document said steps com- 
prising: 

15 

a connecting step to connect to the World Wide Web: 

a retrieving step to retrieve user-defined Web site address information, user-defined Web site commands, and 
user-defined formatting commands; 

an activating step to activate a Web reader so as to access a Web site based on the user-defined Web site 
20 address information and retrieving data from within the Web site based on the user-defined Web site com- 

mands; 

a downloading step to download the retrieved Web site data from the accessed Web site into an extracted data 
tree; 

a flattening step to flatten the extracted data tree into a linear document; 
25 a step to repeat the downloading step and the flattening step until all addresses in the user-defined Web site 

address information have been accessed; and 

a formatting step to format the stored data into the personalized document based on the user-defined fbrmat- 
ting commands. 

30 25. The computer executable process steps of Qaim 24, further comprising a spooling step to spool the personalized 
document to an output device. 

26. The computer executable process steps of Claim 25. wherein the output device is a printer. 

35 27. The computer executable process steps of Claim 25, wherein the output device is a display. 

28. The computer executable process steps of Claim 24, wherein the user-defined Web site commands include com- 
mands for selecting data based on a structure of the Web sita 

40 29. The computer executable process steps of Claim 28. wherein the user-defined Web site commands also include 
commands for selecting data based on a content of the Web site. 

30. An apparatus for retrieving news articles from on-line news services on the Wbrid Wide Web and Ibmiatting the 
news articles into a personafized newspaper, the apparatus comprising: 

45 

first storage means for storing (1) a personal-news-profile which comprises addresses data and command 
data for accessing data from a Web site, and (2) newspaper format commands; 
retrieval means for retrieving the stored personal-newsixofile and accessing data stored therein; 
activating means for activating a Web reader to contact a Web site based on address data stored in the per- 
50 sonal-news-profile; 

downloading means for downloading news articles from the contacted Web site based on command data 
stored in the personal-news-profile; 

second storage means for storing the downloaded news articles; and 

formatting means for formatting ttie stored news articles into the personalized newspaper based on tiie news- 
S5 paper format commands stored in the personal-news-proHle. 

31. The apparatus of Claim 30, further comprising spooling means for spooling the personalized newspaper to a 
printer. 
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32. A method for formatting data from a hypermedia document into a personalized document, comprising the steps of: 

a location specifying step to specify a location of the hypermedia document; 
a type specifying step to specify the type of the hypermedia document; 

a scope specifying step to specify the scope of data to retrieve from the hypermedia document, wherein the 
scope is leased on a structure of the hypermedia document; 

a format specifying step to specify a format for formatting the data retrieved from the hypermedia document 
Into the personalized document; 

an accessing step to access the hypemiedia document found at the location specified in the location specifying 
step; 

a retrieving step to retrieve data from the hypennedia document accessed in the accessing step^ wherein the 
data is retrieved in accoidance with the type specified in the type specifying step and in accordance with the 
scope specified in the scope specifying st^; and 

a formatting step to fomiat the data retrieved in the retrieving step into the personalized document wherein the 
data is formatted In accordance with the Ibrmat specified in the format specifying step. 

33. The method of Claim 32, further comprising a printing step to print the personalized document. 

34. The method of Claim 32, wherein the location specified in the location specifying step is a filename. 

35. The method of Claim 32. wherein the location specified in the location specifying step is a uniform resource locator 
for the Wbrid Wide Web. 

36. A method off processing a hypermedia document, comprising the steps of: 

accessing the hypermedia document; 

extracting addresses from the hypenmedia document; 

storing the addresses extracted from the hypennedia document in a container in a memory; 
activating a processing function to process data stored at the addresses stored in the container; 
downloading the data stored at the addresses in the container into the memory; 

extracting predetermined data from downloaded data in accordance with predetermined configuration informa- 
tion; 

formatting the predetermined data in accordance vwth predefined formatting settings to generate a formatted 
document; and 

processing the formatted document in accordance with the processing function. 

37. A method according to Claim 36. further comprising a step of previewing the formatted document prior to process- 
ing the formatted document 

38. A method according to Claim 37, further comprisirig the steps off: 

changing the formatting settings after previewing the document and before processing the fonmatted document 
in accordance with the processing function; 
re-activab'ng the processing function; and 

re-fonmatling the data In accordance with changed formatting settings to generate the fbmiatted document 

39. A method according to Oaim 36, wherein the addresses are stored In the container in the order that the addresses 
are input into the container; and 

wherein the processing function processes the predetermined data in the order that the addresses are 
stored In the container. 

40. A method according to Claim 39. further comprising the step of rearranging the addresses stored in the container 
by dragging and dropping the addresses within the container. 

41. A method according to Claim 36, further comprising a step off inputting the formatting settings and configuration 
Information via a graphical user Interface. 

42. A method according to Qaim 41 , wherein the graphical user interface conprises plural processing icons, one off 
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which activates the processing function. 
43. A method according to Claim 42, wherein the graphical user internee is displayed in plural modes. 

5 44. A method according to Claim 43, wherein the plural modes comprise (1) a fully-functional mode in which the graph- 
ical user interface displays formatting fields, processing options, menus and the processing icons, and (2) a mini- 
mizing mode In which the graphical user interface displays only the processing Icons. 

45. A method according to Claim 44. wherein the graphical user Interface displayed in the minimizing mode is displayed 
to during browsing the hypermedia document. 

46. An apparatus for processing a hypermedia document comprising: 

a Web read which accesses the hypermedia document; 
IS means for extracting addresses from the hypermedia document; 

a memory including a container which stores the addresses extracted from the hypennedia document; 

a gr^ical user interface having processing icons which activate at least one processing function to process 

data stored at the addresses stored in the container; and 

processing means wNch (1) downloads the data stored at the addresses stored in the container into the mem- 
20 ory, (2) extracts predetermined data from downloaded data in accordance with predefined configuration set- 

tings, (3) formats the predetermined data in accordance with predefined formatting settings to generate a 
formatted document, and (4) processes the famatled document in accordance with the processing function. 

47. An apparatus according to Claim 46, further comprising previewing means for previewing the formatted document 
2S prior to processing the formatted document. 

48. An apparatus according to Claim 46, wherein the addresses are stored In the container in the order that the 
addresses are input Into the container; and 

wherein the processing function processes the predetermined data in the order that the addresses are 
30 stored in the container. 

49. An apparatus according to Claim 48. further comprising dragging and dropping means for dragging and dropping 
the addresses listed in the container in order to rearrange the addresses in the container. 

3S 50. An apparatus according to Claim 46. further comprising inputting means for inputting the formatting settings and 

configuration information via a graphical user interface. 

51. An apparatus accorcRng to Claim 50, wherein the graphical user interface comprises plural processing icons, one 
of which activates to the processing function. 

40 

52. An apparatus according to Claim 51 , further comprising cfisplay means for displaying the graphical user interface in 
plural nrxxJes. 

53. An apparatus according to Claim 52, wherein the plural modes comprise (1) a fully-functional mode in which the 
45 graphical user Interface displays formatting fields, processing options, menus and the processing icons, and (2) a 

minimizing mode in which the graphical user interface cfisplays only the processing icons. 

54. An apparatus according to Claim 53, wherein the graphical user interface displayed in the minimizing mode is dis- 
played during browsing the hypermedia document 

so 
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